-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[w2vbert] fintune: make fintune run #2409
base: main
Are you sure you want to change the base?
Conversation
Mddct
commented
Mar 14, 2024
•
edited
Loading
edited
- small batch size + fp32 training
- fsdp in another pr [train_engine] support fsdp #2412 and then run relatively large batch size + fp32/fp16
周神,是不是已经可以用于wenet训练了,只是少个pipeline,我这就拿来试试,赞一个 |
训是可以训, 原始模型是两倍将采样 ddp模式下特别消耗显存 你可以先用deepspeed训 |
这几个是预训练有关的 fintune不影响, 这两天我也在看这个pr 看下还有什么diff的 |
@fclearner 可以按照当前的代码, 然后base encoder里边的after norm去掉 跑下, 我这边跑了一晚上(没跑完),看着像在收敛中 dataset_conf:
filter_conf:
max_length: 4800
min_length: 10
token_max_length: 100
token_min_length: 1
resample_conf:
resample_rate: 16000
speed_perturb: true
**feats_type: 'w2vbert_fbank'**
fbank_conf:
num_mel_bins: 80
frame_shift: 10
frame_length: 25
dither: 0.1
spec_aug: true
spec_aug_conf:
num_t_mask: 2
num_f_mask: 2
max_t: 50
max_f: 30
shuffle: true
shuffle_conf:
shuffle_size: 1500
sort: true
sort_conf:
sort_size: 500 # sort_size should be less than shuffle_size
batch_conf:
batch_type: dynamic
max_frames_in_batch: 8000
decoder_conf:
attention_heads: 16
dropout_rate: 0.1
linear_units: 4096
num_blocks: 6
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.0
src_attention_dropout_rate: 0.0
gradient_checkpointing: true
use_sdpa: true
encoder: conformer
encoder_conf:
activation_type: swish
attention_dropout_rate: 0.0
gradient_checkpointing: true
attention_heads: 16
causal: true
cnn_module_kernel: 31
cnn_module_norm: layer_norm
conv_bias: false
dropout_rate: 0.1
input_layer: stack_n_frames
linear_units: 4096
normalize_before: true
num_blocks: 24
output_size: 1024
pos_enc_layer_type: w2vbert_pos
positional_dropout_rate: 0.0
selfattention_layer_type: shaw_rel_selfattn
static_chunk_size: -1
use_dynamic_chunk: false
use_dynamic_left_chunk: false
use_sdpa: true
grad_clip: 5
input_dim: 80
log_interval: 100
save_interval: 3000
max_epoch: 80
model: asr_model
model_conf:
ctc_weight: 0.3
length_normalized_loss: false
lsm_weight: 0.1
optim: adam
optim_conf:
lr: 0.0001
# lr: [0.001, 0.0001]
# lr: [0.001, 0.001, 0.0001]
# lr: [0.001, 0.00001]
# lr: 0.01
# modules: ['ctc', 'decoder']
# modules: ['ctc']
scheduler: warmuplr
scheduler_conf:
# warmup_steps: [1500, 1500, 5000]
warmup_steps: 5000 |
好咧,我今天试试,太给力了 |
周神,你这里训练用的decoder是加载的预训练的吗,我感觉数据量少的话,decoder训练不充分,ctc的结果倒是正常了。 我记录下我训练的问题:
我之前把normalize_before设成false了,以为这样可以屏蔽encoder的self.after_norm,但是这么做,encoder_layer里面的normalize_before也成false了 |
正常conformer 实现不应该有after norm的最后一个norm,就是最顶层的最后一个(transformer pre norm需要) 加decoder的想法是 后边换成whisper的decoder 或者llm的decoder, 目前可以采用multi lr的方式的去训, |
嗯嗯,了解,我已经用上了multi lr,很好用(赞赞赞) 我感觉 加其他模块的decoder理论上会有特征不匹配的情况,还是需要大量的数据去训练,否则训练不充分 |
可用 adapter , 谷歌有些modular的speech 的paper 就是这么嫁接模型的 ,llm也有些类似的 顺别说下, 接whisper只是为了做识别, 接llm就走了speechLLM的路子了 可做语音理解等 |
醍醐灌顶,llama pro的块扩展方案感觉也可以试试,好奇如果whisper微调的时候换了下采样层,然后冻结其他层,加几个encoder block在前部能不能保留泛化性,如果能用微调解决上下游不匹配的问题,羊毛就可以一直薅了,哈哈 speechLLM绝对是未来,跟着wenet学(赞赞赞) |
周哥,你就是哦们滴神 |
感觉效果不太行,我拿来训几百小时粤语,没有whisper+lora微调效果好,差了五六个点的cer,而且奇怪的是我拿https://huggingface.co/TencentGameMate/chinese-hubert-large 这个hubert模型微调的效果比w2vbert2要好,可能是中文数据预训练的不够多吧 |
hugface上有w2vbert2的fintune代码 有卡有时间的话 帮忙跑下aishell的 看下hg能到多少, 我过几天debug下 先和hg上对齐下, 现在不确定实现是否有不一致 |
好咧,跟着周神做实验,我先跑个hugging face的 |
显存问题,可先考虑:#2550 |
周神,我最近在做大模型的实验,我感觉adapter也需要大量数据训练,毕竟映射的LLM token好多也是byte pair的 |
是需要 参考llama3.1 speech的部分 |